First load required packages and set some global parameters.
Code
library(tidyverse)library(brms)library(tidyboot)library(tidyjson)library(patchwork)library(GGally)library(cowplot)library(BayesFactor)library(aida) # custom helpers: https://github.com/michael-franke/aida-packagelibrary(faintr) # custom helpers: https://michael-franke.github.io/faintr/index.htmllibrary(cspplot) # custom styles: https://github.com/CogSciPrag/cspplotlibrary(ordbetareg) # for ordered beta-regressionlibrary(cmdstanr)################################################### these options help Stan run fasteroptions(mc.cores = parallel::detectCores(),brms.backend ="cmdstanr")# use the CSP-theme for plottingtheme_set(theme_csp())# global color scheme from CSPproject_colors = cspplot::list_colors() |>pull(hex)# setting theme colors globallyscale_colour_discrete <-function(...) {scale_colour_manual(..., values = project_colors)}scale_fill_discrete <-function(...) {scale_fill_manual(..., values = project_colors)}##################################################rerun_models <-FALSELambert_test <-function(loo_comp) {1-pnorm(-loo_comp[2,1], loo_comp[2,2])}##################################################rl_file <-"R_data_4_TeX/myvars.csv"myvars =list()# add with: myvars[name] = value
Read & massage the data
The data is pre-processed (using Python). We load it and rearrange for convenience.
Code
d <-read_csv("../results/round_2.0/results_preprocessed.csv") |># drop column with numbersselect(-`...1`) |># set "non-answers" to AnswerPolarity "positive"mutate(AnswerPolarity =ifelse( AnswerCertainty =="non_answer", "positive", AnswerPolarity)) |># casting into factormutate(group =factor(group, levels =c("relevant", "helpful")),ContextType =factor(ContextType, levels =c("negative", "neutral", "positive")),AnswerPolarity =factor(AnswerPolarity, levels =c("positive", "negative")),AnswerCertainty =factor(AnswerCertainty, levels =c("non_answer", "low_certainty", "high_certainty", "exhaustive")) )
Data exclusion
We exclude all data from participants who:
score less than perfect on all attention checks,
scored less than 0.5 on reasoning tasks, or
has task-sensitivity of not more than 0.75
Task sensitivity is the proportion of critical trials, excluding non-answer trials, in which the change between prior and posterior rating was bigger than 0.05 or there was a non-zero change in commitment rating.
group: whether a participant rated the ‘helpfulness’ or the ‘relevance’ of the answer (between-subjects variable)
ContextType: whether the context made a ‘no’ or a ‘yes’ answer more likely /a priori/ or whether it was neutral (within-subjects)
AnswerCertainty: how much information the answer provides towards a fully resolving answer (within-subjects)
AnswerPolarity: whether the answer suggests or implies a ‘no’ or a ‘yes’ answer (within-subjects)
‘non-answers’ are treated as ‘positive’ for technical purposes, but this does not influence relevant analyses
In the following, we first check whether these experimental manipulations worked as intended.
Sanity-checking whether the manipulations worked as intended
Effects of ContextType on prior and prior commitment
To check whether the ContextType manipulation worked, we compare how participants rated the prior probability of a ‘yes’ answer under each level of the ContextType variable. Concretely, we expect this order of prior ratings for the levels of ContextType: negative < neutral < positive. Although we have no specific hypotheses or sanity-checking questions regarding the commitment ratings, let’s also scrutinize the commitment ratings that participants gave with their prior ratings.
Prior ratings as a function of ContextType
Here is a first plot addressing the question after an effect of ContextType on participants prior ratings.
Code
d |>ggplot(aes(x = prior_sliderResponse, color = ContextType, fill = ContextType)) +geom_density(alpha =0.3) +xlab("prior rating") +ylab("")
We dive deeper by fitting a regression model, predicting prior ratings in terms of the ContextType. Since participants have not seen the answer when they rate the prior probability of a ‘yes’ answer, ContextType is the only fixed effect we should include here. The model also includes the maximal RE structure. We use the ordbetareg package for (slider-data appropriate) zero-one inflated ordinal beta regression.
Our assumption is that prior ratings are higher in contexts classified as ‘neutral’ than in ‘negative’ contexts, and yet higher in ‘positive’ contexts. We use the faintr package to extract information on these directed comparisons.
To scrutinize the effect of ContextType on participants expressed commitment in their prior ratings, we use a ordered-logit regression (since prior commitment ratings are from a rating scale).
The results of these comparisons are summarized here:
comparison
measure
posterior
HDI-low
HDI-high
negative < neutral
prior
0.9973333
0.292325
1.134670
neutral < positive
prior
0.9998333
0.507056
1.558420
neutral < negative
prior-commitment
0.9891667
0.131985
1.284680
negative < positive
prior-commitment
0.7971667
-0.505641
1.205520
neutral < positive
prior-commitment
0.9958333
0.347629
1.798132
The ContextType manipulation seems to have worked as expected for the prior ratings: lower in ‘negative’ than in ‘neutral’ than in ‘positive’. There is no support for differences in the commitment ratings, except that the positive context case seems to induce more commitment than the neutral context.
Effects of AnswerPolarity and AnswerCertainty on beliefChange
We can define beliefChange as the difference between posterior and prior in the direction expected from the answer’s polarity (posterior belief in ‘yes’ answer increases for a ‘positive’ answer when compared with the prior rating, but it decreases for ‘negative’ answers). (Careful: we ignore non-answers (which are categorized as “positive” for technical convenience only).) If our manipulation worked, we expect the following for both ‘positive’ and ‘negative’ polarity:
beliefChange is > 0
beliefChange is lower for ‘low certainty’ than for ‘high certainty’ than for ‘exhaustive’
To address the first issue, whether beliefChange is positive for both types of polartiy, we first regress beliefChange against the full list of potentially relevant factors, including plausible RE structures. Notice that at the time of answer the questions related to the posterior, participants have not yet seen the question after relevance or helpfulness, so that factor group should be ommitted.
Code
# TODO: strictly speaking, this is data from a bounded scale; different regression model would be appropiateif (rerun_models) { fit_answer_SC <-brm(formula = beliefChange ~ ContextType * AnswerCertainty * AnswerPolarity + (1+ ContextType + AnswerCertainty + AnswerPolarity | StimID) + (1+ ContextType + AnswerCertainty + AnswerPolarity | submission_id),data = d |>filter(AnswerCertainty !="non_answer") |>mutate(beliefChange = posterior_sliderResponse - prior_sliderResponse,beliefChange =ifelse(AnswerPolarity =="positive", beliefChange, - beliefChange)) )saveRDS(fit_answer_SC, "cachedModels-round2/fit_answer_SC.Rds")} else{ fit_answer_SC <-readRDS("cachedModels-round2/fit_answer_SC.Rds")}
We check if inferred cell means are credibly bigger than zero, for all six relevant design cells (facets in the plot above).
Code
# 1. Check if belief change in each cell is bigger than zerocellDraws_answers <-tibble(low_pos =filter_cell_draws( fit_answer_SC, AnswerCertainty =="low_certainty"& AnswerPolarity =="positive", "low_pos")$low_pos,high_pos =filter_cell_draws( fit_answer_SC, AnswerCertainty =="high_certainty"& AnswerPolarity =="positive", "high_pos")$high_pos,exh_pos =filter_cell_draws( fit_answer_SC, AnswerCertainty =="exhaustive"& AnswerPolarity =="positive", "exh_pos")$exh_pos,low_neg =filter_cell_draws( fit_answer_SC, AnswerCertainty =="low_certainty"& AnswerPolarity =="negative", "low_neg")$low_neg,high_neg =filter_cell_draws( fit_answer_SC, AnswerCertainty =="high_certainty"& AnswerPolarity =="negative", "high_neg")$high_neg,exh_neg =filter_cell_draws(fit_answer_SC, AnswerCertainty =="exhaustive"& AnswerPolarity =="negative", "exh_neg")$exh_neg) # all posterior 95% HDIs are wayabove 0 apply( cellDraws_answers |>as.matrix(), MARGIN =2, aida::summarize_sample_vector)
$low_pos
# A tibble: 1 × 4
Parameter `|95%` mean `95%|`
<chr> <dbl> <dbl> <dbl>
1 "" 0.0861 0.131 0.176
$high_pos
# A tibble: 1 × 4
Parameter `|95%` mean `95%|`
<chr> <dbl> <dbl> <dbl>
1 "" 0.202 0.257 0.306
$exh_pos
# A tibble: 1 × 4
Parameter `|95%` mean `95%|`
<chr> <dbl> <dbl> <dbl>
1 "" 0.396 0.447 0.499
$low_neg
# A tibble: 1 × 4
Parameter `|95%` mean `95%|`
<chr> <dbl> <dbl> <dbl>
1 "" 0.105 0.177 0.250
$high_neg
# A tibble: 1 × 4
Parameter `|95%` mean `95%|`
<chr> <dbl> <dbl> <dbl>
1 "" 0.206 0.282 0.356
$exh_neg
# A tibble: 1 × 4
Parameter `|95%` mean `95%|`
<chr> <dbl> <dbl> <dbl>
1 "" 0.301 0.363 0.428
Code
# posterior probability of mean bigger 0 for each cell is almost 1 everywhereapply(as.matrix(cellDraws_answers), MARGIN=2, function(x) {mean(x>0)})
These results suggest that there is little reason to doubt that the belief changes induces by the answers -as per the experimentally intended manipulation- went in the right direction in all cases.
beliefChange increases with more informative answers
Finally, we investigate whether beliefChange increases with more informative answers, using the same regression model as before.
We see no indication of a main effect of polarity, but find support for the idea that our manipulation of AnswerCertainty induced gradually larger belief changes. I sum, it seems that the stimuli were adequately created to implement the intended manipulation in the variables AnswerCertainty and AnswerPolarity.
Predicting relevance in terms of the experimental factors
We want to explore how relevance ratings depend on the experimental manipulations. First, we check whether the group variable (the trigger word: ‘helpful’ vs ‘relevant’) is important. If not, we can simplify subsequent analyses.
Next, we investigate the effects variables like AnswerCertaintyAnswerPolarity etc. on relevance ratings.
Can we gloss over the different trigger words?
To simplify analyses, it would be helpful to know whether we can gloss over the group manipulation. So, does it matter whether participants were asked to rate relevance or helpfulness?
To start with, let’s just look at whether there is a main effect, which there is not (possibly also partially explained away by by-subject random slopes):
To further investigate this contrast, we may fit two beta regression models, one with and one without the group factor. We check whether there is a credible main effect of group in the full model and a significant difference in LOO score when comparing these models. The former is for fun, the latter determines whether we should lump trigger words together.
Code
if (rerun_models) {# TODO too small ESS fit_with_group_ordBeta <- ordbetareg::ordbetareg(# must omitt interactions in the REs to ensure proper fitformula = relevance_sliderResponse ~ group * ContextType * AnswerCertainty * AnswerPolarity + (1+ group + ContextType + AnswerCertainty + AnswerPolarity || StimID) + (1+ ContextType + AnswerCertainty + AnswerPolarity || submission_id),data = d,# set this prior, otherwise there are errorscoef_prior_SD =5,save_pars =save_pars(all=T) ) fit_without_group_ordBeta <- ordbetareg::ordbetareg( relevance_sliderResponse ~ ContextType * AnswerCertainty * AnswerPolarity + (1+ ContextType + AnswerCertainty + AnswerPolarity || StimID) + (1+ ContextType + AnswerCertainty + AnswerPolarity || submission_id),data = d,coef_prior_SD =5,save_pars =save_pars(all=TRUE) )saveRDS(fit_with_group_ordBeta, "cachedModels-round2/fit_with_group_ordBeta.Rds")saveRDS(fit_without_group_ordBeta, "cachedModels-round2/fit_without_group_ordBeta.Rds")} else { fit_with_group_ordBeta <-read_rds("cachedModels-round2/fit_with_group_ordBeta.Rds") fit_without_group_ordBeta <-read_rds("cachedModels-round2/fit_without_group_ordBeta.Rds")}
We inspect whether there is a main effect of group in the full model:
Code
# main effect of "group" ?group_main <-compare_groups( fit_with_group_ordBeta,higher = group =="relevant",lower = group =="helpful")print(group_main)
Outcome of comparing groups:
* higher: group == "relevant"
* lower: group == "helpful"
Mean 'higher - lower': 0.2069
95% HDI: [ 0.07441 ; 0.3409 ]
P('higher - lower' > 0): 0.9978
Posterior odds: 443.4
The posterior for the main factor of group is credibly different from zero.
But that is not the best criterion to decide whether the group distinction is relevant for predictions. We there also compare models with LOO cross-validation:
It appears that, when comparing these models with REs, the lesioned model is numerically even slightly better than the model including group, but not by a substantial margin. From a predictive point of view, including group does not add substantial value. We will therefore lump trigger words together in the following.
Effect of AnswerPolarity, AnswerCertainty and ContextType on relevance ratings
To investigate further which experimental factors influence the ratings of relevance of an answer, start by a visualization:
The table shows results indicating that there are (non-surprising) effects of AnswerType with non-answers rated as least relevant, followed by low-certainty, then high-certainty answers, and final exhaustive answers. There is no (strong) indication for a main effect of AnswerPolarity or ContextType. The lack of an effect of ContextType might be interpreted as prima facie evidence in favor of quantitative notions or relevance that do not take the prior into account (at least not very prominently).
Here is a plot of the relevant posterior draws visually supporting why we compared the three factor levels of ContextType in the way we did (positive is the lowest, neutral the highest, but this difference is still not strongly indicative of a difference (0 included in HDI)):
Research hypotheses 1 and 2 are basic predictions in terms of simple measures of first- and second-order belief change. Research hypothesis 3 is about different notions of quantifying informational relevance.
The hypothesis is that higher belief changes (induced by the answer) lead to higher relevance ratings. We test this hypothesis by a linear beta regression model (with maximal random effects) that regresses relevance ratings against the absolute difference between prior and posterior ratings (first_order_belief_change). We judge there to be evidence in favor of this hypothesis if the relevant slope coefficient is estimated to be credibly bigger than zero (posterior probability > 0.944; an arbitrary value to indicate that there is nothing special about 0.95) and a loo-based model comparison with an intercept only model substantially favors the model that includes the relevant slope.
Run the regression model:
Code
if (rerun_models) { fit_belief_diff <-ordbetareg(formula = relevance_sliderResponse ~ first_order_belief_change + (1+ first_order_belief_change | submission_id) + (1+ first_order_belief_change | StimID),data = d )# TODO check how to run Intercept-only model in ordbetareg fit_belief_diff_interceptOnly <-ordbetareg(formula = relevance_sliderResponse ~ .,data = d |>mutate(Int =1) )write_rds(fit_belief_diff, "cachedModels-round2/fit_belief_diff.rds")write_rds(fit_belief_diff_interceptOnly, "cachedModels-round2/fit_belief_diff_interceptOnly.rds")} else { fit_belief_diff <-read_rds("cachedModels-round2/fit_belief_diff.rds") fit_belief_diff_interceptOnly <-read_rds("cachedModels-round2/fit_belief_diff_interceptOnly.rds")}
Hypothesis 2: commitment change additionally contributes to relevance rating
We also hypothesize that change in commitment (second_order_belief_change) ratings additionally contributes to predicting relevance ratings. Concretely, we address this hypothesis with a linear beta regression model like for hypothesis 1, but also including the absolute difference in commitment ratings for before and after the answer (and the interaction term). We use the maximal RE-structure. We speak of evidence in favor of this hypothesis if the relevant posterior slope parameter is credibly bigger than zero and a loo-based model comparison favors the more complex model. We speak of evidence against this hypothesis if the loo-based model comparison favors the simpler model.
Hypothesis 3: “Bayes Factor utility” is the best single-factor predictor of relevance ratings
The third hypothesis is that the bayes_factor_utility is a better (single-factor, linear) predictor of relevance_sliderResponse than kl_utility and entropy_change. We address this hypothesis with LOO cross-validation. We also directly include the exploratory hypothesis 1 here, thus comparing all single-factor models.
# Comparing the best model to the second best modelmyvars["hyp3BFvsBFBetaLOODiff"] <--1* (loo_comp_hyp3[2,1] |>round(3))myvars["hyp3BFvsBFBetaLOOSE"] <- loo_comp_hyp3[2,2] |>round(3)myvars["hyp3BFvsBFBetaPValue"] <-Lambert_test(loo_comp_hyp3) |>round(3)
Yes, there is a noteworthy difference.
We can conclude that first-order BF is the single best predictor of the relevance ratings.
<<<<<<< HEAD
Now we want to visualize all results:
Code
library(viridis)
Code
d |>ggplot(aes(x = prior_sliderResponse, y = posterior_sliderResponse, color = relevance_sliderResponse)) +geom_point(size =3, alpha =0.5) +scale_color_viridis()
And let’s zoom in on some interesting spots:
Code
d |>ggplot(aes(x = prior_sliderResponse, y = posterior_sliderResponse, color = relevance_sliderResponse)) +geom_jitter(width =0.01, height =0.01, size =3, alpha =0.7) +scale_color_viridis() +coord_cartesian(xlim =c(0, 0.25), ylim =c(0, 0.25))
Note: redo these plots but show delta of relevance_sliderResponse and BF etc.
And we do the same thing with the commitment ratings TODO: Sanity check whether condition with high bias prior and contradictory answer lead to decrease in commitment.
Code
d |>ggplot(aes(x = prior_confidence, y = posterior_confidence, color = relevance_sliderResponse)) +geom_jitter(width =0.4, height =0.4, size =3, alpha =0.4) +scale_color_viridis()
And now we can plot delta commitment and delta probability as x and y TODO: try removing non-answers from these (and other) plots
Code
d |>ggplot(aes(x = first_order_belief_change, y = second_order_belief_change, color = relevance_sliderResponse)) +geom_jitter(width =0.0, height =0.3, size =3, alpha =0.4) +scale_color_viridis()
Code
d |>ggplot(aes(x = first_order_belief_change, y = second_order_belief_change, color = relevance_sliderResponse)) +geom_jitter(width =0.01, height =0.3, size =3, alpha =0.4) +scale_color_viridis() +coord_cartesian(xlim =c(-0.01, 0.15), ylim =c(-0.3, 4))
======= # Addressing the exploratory hypotheses
Exploratory Hypothesis 2: adding pure_second_order_belief_change to all first-order measures
To complement hypothesis 2, we also test whether adding another measure of higher-order uncertainty change adds predictive performance to each first-order measure of belief change. So here we compare, for each measure \(X\) (“entropy change”, “KL”, and “Bayes factor”) for first-order belief change, whether adding the factor pure_second_order_belief_change increases the predictive performance. Concretely, we compare a model with single factor \(X\) as a predictor to a model with predictors \(X\), pure_second_order_belief_change and their interaction. For ease of fitting, no random effects are included.
Exploratory Hypothesis 3: compare all combinations of first- and second-order measures
Finally, we just compare models with all combinations of first- and second-order measures. Questions of interest are:
Which arbitrary combination of first- and second-order measures is the best?
Does it matter to be consistent in the choice of first- and second order measure, i.e., is the performance of “first-order X” always most boosted when we supply it with “second-order X” instead of some other “second-order Y”?
Let’s run the models first. For ease of fitting, no random effects are included.
It seems that the overall best model in this comparison is the one that uses Bayes-factor based measures consistently. The second-order Bayes-factor based measure also seems to be the best to add to the other first-order measures. This also means that it is not the case the “being consisten” in choic of first- and second-order measure is always best.